GROUP 12: Alicja Wilk | Nijanth Anand | Qinyu Xia | Wai Phyo | Yiting Fu
First, let's import the necessary packages and load in the two data sets we'll use for the entirety of this assignment (products and copurchase).
library(car)
library(dplyr)
library(tidyr)
library(igraph)
library(ggplot2)
library(corrplot)
copurchase <- read.csv("copurchase.csv", stringsAsFactors = FALSE)
products <- read.csv("products.csv", stringsAsFactors = FALSE)
Products contains information about each individual product with the following descriptors:
head(products)
Copurchase contains the co-purchase information where a "source" product sale also leads to a "target" product sale.
head(copurchase)
First, we will define the boundary of the co-purchase network to only products that are books and have a salesrank value between 0 and 150,000 inclusive.
books.products <- filter(products, group == "Book"
& salesrank <= 150000 & salesrank >= 0)
books.copurchase <- filter(copurchase, Source %in% books.products$id
& Target %in% books.products$id)
We will then find the indegree of each product, to show how many edges are to the each product in co-purchase network.
indegree.df <- summarize(group_by(books.copurchase, Target), indegree = n())
names(indegree.df)[1]<-"id"
head(indegree.df)
Likewise, for the outdegree of each product.
outdegree.df <- summarize(group_by(books.copurchase, Source), outdegree = n())
names(outdegree.df)[1]<-"id"
head(outdegree.df)
Now we merge the two created dataframes (indegree.df and outdegree.df) back into our original products dataframe while creating a new column "degree" which is the sum of indegree and outdegree.
books.graph <- merge(books.products, indegree.df, by="id", all.x = TRUE)
books.graph <- merge(books.graph, outdegree.df, by="id", all.x = TRUE)
books.graph$indegree[is.na(books.graph$indegree)] <- 0
books.graph$outdegree[is.na(books.graph$outdegree)] <- 0
books.graph <- mutate(books.graph, degree = indegree + outdegree)
head(books.graph)
We then pick a product with the highest degree in order to find its subcomponent, i.e. all the products that are connected to this focal product.
However, there seems to be a tie with two products (33 and 4429), both having a degree of 53. We chose to work with 4429 as our focal product.
filter(books.graph, degree == max(books.graph$degree))
Next, we construct a subgraph using a subcomponent with product 4429 as its root node.
The resulting subgraph has 904 vertices with 1173 vertices.
g <- graph_from_data_frame(books.copurchase, directed = TRUE)
sg <- induced_subgraph(g, subcomponent(g, "4429", "all"), impl = "auto")
sg <- simplify(sg, remove.multiple = F, remove.loops = T)
V(sg)
E(sg)
Its diameter (longest geodesic) traverses 10 vertices, starting from vertice 37895 to 6676.
diameter <- get_diameter(sg)
diameter
V(sg)$color <- ifelse(V(sg)$name %in% diameter$name, "red", "lightblue")
V(sg)["4429"]$color <- "green"
V(sg)["33"]$color <- "gold"
E(sg)$color <- "darkgray"
E(sg,path=diameter)$color <- "red"
E(sg)$width <- 1
E(sg,path=diameter)$width <- 3
options(repr.plot.width = 100, repr.plot.height = 100)
plot(sg, layout=layout_with_fr, vertex.size=1, vertex.label=NA, edge.arrow.size=0.05)
Insights from the subgraph visualizations:
Next, we can can look at the cumulative degree distribution of our subcomponent.
degree.distr <- degree_distribution(sg, cumulative = TRUE, mode = "all")
options(repr.plot.width = 10, repr.plot.height = 10)
p <- ggplot() +
geom_line(aes(x = c(1:54), y = 1 - degree.distr), color = "darkgray") +
geom_point(aes(x = c(1:54), y = 1 - degree.distr), color = "darkorange") +
labs(title ="Cumulative Degree Distribution of Subcomponent", x = "Degree", y = "Cumulative Frequency") +
scale_x_discrete(limits = c(1:54)) +
theme_minimal()
print(p)
We see that there are at least 3 or more degree in each vertice (p <= 0.375). Moreover, we can also see that the cumulative frequency approaches 1 as degree approaches 20 or more.
As for the edge density of our subgraph, i.e. the proportion of present edges from all possible edges in the network, the value equates to:
edge_density(sg)
This means that our subgraph is quite sparse and ideally in terms of product copurchases, Amazon would want a highly interconnected network to drive growth and engagement.
Next, we can calculate network level centralization indices for our subgraph:
cat("Network level degree centralization index: ", centr_degree(sg)$centralization)
Network level degree centralization index is calculated using:
$Cd = \frac{\sum_{i=0} [cd^* - cd_i]}{max\sum_{i=0}[cd^* - cd_i]}$
Where, $cd^* = max(cd_1, cd_2, ...,cd_n)$
Hence, our subgraph has nodes that share a similar degree centrality.
cat("Network level closeness centralization index: ", centr_clo(sg, mode = "all")$centralization)
Network level degree closeness centralization index is calculated using:
$Cc = \frac{\sum_{i=0} [cc^* - cc_i]}{max\sum_{i=0}[cc^* - cc_i]}$
Where, $cc^* = max(cc_1, cc_2, ...,cc_n)$
Hence, our subgraph has nodes that share a similar closeness score.
cat("Network level betweeness centralization index: ", centr_betw(sg, directed = TRUE)$centralization)
Network level degree betweeness centralization index is calculated using:
$Cb = \frac{\sum_{i=0} [cb^* - cb_i]}{max\sum_{i=0}[cb^* - cb_i]}$
Where, $cb^* = max(cb_1, cb_2, ...,cb_n)$
Hence, our subgraph has nodes that share a similar betweeness score.
cat("Network level eigenvector centralization index: ", centr_eigen(sg, directed = TRUE)$centralization)
Network level eigenvector cetralization index dteremines how proportional is the centrality of each vertex is to the sum of the centralities of its neighbors.
i.e. A central actor is connected to other central actors (influence).
Hence, our graph has nodes that are connected to many nodes who themeslves have high centrality scores.
Now, we can also add the local centralization indices (above) to each vertex along with their relative hub/authority scores.
sub.books <- filter(books.graph, id %in% V(sg)$name)
cd <- data.frame("id" = V(sg)$name, "centr_clo" = centr_clo(sg, mode = "all")$res,
"centr_betw" = centr_betw(sg, directed = TRUE)$res, "centr_eigen" = centr_eigen(sg, directed = TRUE)$vector,
"hub_score" = hub_score(sg, weights=NA)$vector, "auth_score" = authority_score(sg, weights=NA)$vector)
sub.books <- merge(sub.books, cd, by = "id")
Next, we calculate the relevant neighbor scores... and add it as columns.
sub.books$nghb_mn_rating <- 0
sub.books$nghb_mn_salesrank <- 0
sub.books$nghb_mn_review_cnt <- 0
for(i in 1:nrow(sub.books))
{
curr.id <- as.character(sub.books[i,]$id)
nbors <- filter(sub.books, id %in% neighbors(sg, curr.id, mode = c("in"))$name)
sub.books[i, "nghb_mn_rating"] <- mean(nbors$rating)
sub.books[i, "nghb_mn_salesrank"] <- mean(nbors$salesrank)
sub.books[i, "nghb_mn_review_cnt"] <- mean(nbors$review_cnt)
}
select(sub.books, id, salesrank, review_cnt, rating, degree,
centr_clo, centr_betw, centr_eigen, hub_score, auth_score,
nghb_mn_rating, nghb_mn_salesrank, nghb_mn_review_cnt) %>% head()
Before we actually fit our poisson regression, it's good practice to see the correlation between our predictors.
Here, we'll notice a couple interesting linear relationships:
sb.model <- select(sub.books, salesrank, review_cnt, rating, degree,
centr_clo, centr_betw, centr_eigen, hub_score, auth_score,
nghb_mn_rating, nghb_mn_salesrank, nghb_mn_review_cnt)
sb.model[is.na(sb.model)] <- 0
M <- cor(sb.model)
corrplot(M, method = "circle")
First attempt at our poisson regression model, we'll pretty much throw all our predictors in as is.
summary(p.model.1 <- glm(salesrank ~ review_cnt + rating + degree +
centr_clo + centr_betw + centr_eigen + hub_score + auth_score +
nghb_mn_rating + nghb_mn_salesrank + nghb_mn_review_cnt,
family="poisson", data = sb.model))
All the predictors seem to be statistically significant including the intercept.
What about its Model Evaluation Metrics?
cat("Mean Absolute Error", mad(residuals(p.model.1)))
cat("Mean Squared Error", mean(residuals(p.model.1)^2))
cat("Root Mean Squared Error", sqrt(mean(residuals(p.model.1)^2)))
One could be satisfied with the model fitted above with relatively low error metrics.
But, before we decide whether this regression model is a good fit, we will need to conduct residual analysis.
options(warn=-1)
par(mfrow = c(2, 2))
plot(p.model.1)
Above, we are concered with the two graphs (Residual vs Fitted) and (Scale-Location).
Ideally, we want our residuals to:
However, our residual plots seems to be skewed and unbalanced.
Hence, we have a couple alternatives to improve our regression model:
To detect multicollinearity, we can calcuate their Variance Inflation Factor (VIF).
Ideally, we want the VIFs of our predictors to be less than 5. If it is higher than 10, that means that two or more predictors might be correlated with each other
vif(p.model.1)
It seems like all our predictors' VIFs are less than 5, meaning we do not need to remove any as there is no issue of multicollinearity in our model.
Next, we visualize the histograms of our predictors (to see if they are skewed).
ggplot(gather(sb.model), aes(value)) +
geom_histogram(bins = 50) +
facet_wrap(~key, scales = 'free_x')
We see that almost all of them except rating and closeness centrality are not skewed.
We can now revise our poisson regression model by taking logs of all the predictors except rating and closeness centrality.
summary(p.model.2 <- glm(salesrank ~ log(review_cnt + 1) + rating + log(degree + 1) + centr_clo + log(centr_betw + 1) +
log(centr_eigen + 1) + log(hub_score + 1) + log(auth_score + 1) + log(nghb_mn_rating + 1) +
log(nghb_mn_salesrank + 1) + log(nghb_mn_review_cnt + 1), family="poisson", data = sb.model))
cat("Mean Absolute Error", mad(residuals(p.model.2)))
cat("Mean Squared Error", mean(residuals(p.model.2)^2))
cat("Root Mean Squared Error", sqrt(mean(residuals(p.model.2)^2)))
We see that the revised model has a lower Root Mean Squared Error ableit the MSE and MAE are similar to the previous model.
Also, all the predictors are statistically signifnicant along with the intercept as well.
Moving on to the residual analysis...
par(mfrow = c(2, 2))
plot(p.model.2)
Now, though not completely perfect, we can see that the residual plots are closer to what is desired (randomly distributed).
Hence, the revised poisson model is much preferred and a better fit compared to the first model.